Search Result

Journals

Publication Years

Keywords

Please wait a minute...

For Selected:

Download Citations
EndNote Ris BibTeX

Toggle Thumbnails

Select

Cut-GAR: solution to determine cut-off point in cloud storage system

SHAO Tian, CHEN Guangsheng, JING Weipeng

Journal of Computer Applications 2015, 35 (9): 2497-2502. DOI: 10.11772/j.issn.1001-9081.2015.09.2497

Abstract （511）

PDF （864KB）（276）

Save

Considering poor performance caused by vague definition of small files in Hadoop Distributed File System (HDFS), Cut-off Point via Grey Relational Analysis (Cut-GAR) was presented to find the cut-off point between small files and large files, the relationship among the consumed memory of NameNode (M), speeds of MB of Uploaded Files per Second (MUFS), speeds of MB of Accessed Files per Second (MAFS) and file size was analyzed, the proper file sizes according to the three factors, were set respectively as FM, FMUFS and FMAFS. And then, grey relational analysis was taken to weight impacts of the three factors on file size while file size was treated as evaluated object, and M, MUFS and MAFS were employed as evaluated indexes, therefore the weight of evaluated index and relational degree of index-object were obtained. The outcome that the sum of FM, FMUFS, and FMAFS multiplied by the corresponding index weight was regarded the approximate optimal value of cut-off point. As experiment results demonstrate, Cut-GAR achieves a balance among M, MUFS, and MAFS, which improves the performance of small file processing.

Reference | Related Articles | Metrics

Select

Implementation of decision tree algorithm dealing with massive noisy data based on Hadoop

LIU Yaqiu, LI Haitao, JING Weipeng

Journal of Computer Applications 2015, 35 (4): 1143-1147. DOI: 10.11772/j.issn.1001-9081.2015.04.1143

Abstract （585）

PDF （750KB）（587）

Save

Concerning that current decision tree algorithms seldom consider the influence of the level of noise in the training set on the model, and traditional algorithms of resident memory have difficulty in processing massive data, an Imprecise Probability C4.5 algorithm named IP-C4.5 was proposed based on Hadoop. When training model, IP-C4.5 algorithm considered that the training set used to design decision trees is not reliable, and used imprecise probability information gain rate as selecting split criterion to reduce the influence of the noisy data on the model. To enhance the ability of dealing with massive data, IP-C4.5 was implemented on Hadoop by MapReduce programming based on file split. The experimental results show that when the training set is noisy, the accuracy of IP-C4.5 algorithm is higher than that of C4.5 and Complete CDT (CCDT), especially when the data noise degree is more than 10%, it has outstanding performance; and IP-C4.5 algorithm with parallelization based on Hadoop has the ability of dealing with massive data.

Reference | Related Articles | Metrics